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ABSTRACT 

Prior works have shown that the list of apps installed by a 
user reveal a lot about user interests and behavior. These 
works rely on the semantics of the installed apps and show 
that various user traits could be learnt automatically using 
off-the-shelf machine-learning techniques. In this work, we 
focus on the re-identifiability issue and thoroughly study the 
unicity of smartphone apps on a dataset containing 54,893 
Android users collected over a period of 7 months. Our 
study finds that any 4 apps installed by a user are enough 
(more than 95% times) for the re-identification of the user in 
our dataset. As the complete list of installed apps is unique 
for 99% of the users in our dataset, it can be easily used 
to track/profile the users by a service such as Twitter that 
has access to the whole list of installed apps of users. As 
our analyzed dataset is small as compared to the total pop¬ 
ulation of Android users, we also study how unicity would 
vary with larger datasets. This work emphasizes the need 
of better privacy guards against collection, use and release 
of the list of installed apps. 

Categories and Subject Descriptors: K.4 [Public Policy 
Issues]: Privacy 


1. INTRODUCTION 

People are all unique the way they are or they look. They 
can be easily identified from their DNA sequences, finger¬ 
prints, Iris scans, web browsers and so on. Also, a combina¬ 
tion of various attributes, such as their age, address or re¬ 
ligion makes them unique. Recently, some studies have 
shown that people are also unique in the way they behave. 
For example, de Montjoye et al. illustrated this behavioural 
uniqueness by showing that people are unique in the way 
they move around [^. In fact, they show that only four 
spatio-temporal positions are enough to uniquely identify a 
user 95% of the times in a dataset of one and a half million 
users. Similarly, other studies showed that people are unique 
in the way they purchase goods online or configure their 
browser or browse the web 11 . 


As smartphones have been widely deployed today all over 
the world and the list of installed/running applications 
(apps) on them is readily available to be accessed, the threat 
in terms of user privacy is huge if this data is collected, used 
and released without sufficient diligence in terms of privacy. 
This threat comes in two flavors: first, the semantics of the 
installed apps can tell a lot about the users’ habits and in¬ 
terests and second, the unicity of installed apps could 
make a user re-identifiable if this dataset is released. In 
fact, regarding the first privacy threat, 13 showed that user 
traits such as religion, relationship status, spoken languages, 
countries of interest, and whether or not the user is a parent 
of small children, can be easily predicted from the list or 
even the categories of the installed apps on smartphones. In 
this paper, we focus on the second privacy threat and mea¬ 
sure the unicity of installed apps to be able to measure the 
risk of re-identification if app datasets are released in public 
or shared between two entities. 


It is quite in the news these days that Twitter has started 
to collect the list of apps that a user has installed. They 
claim to use this information for targeted interest-based ad¬ 
vertising among others. However, it might be a privacy con¬ 
cern if Twitter shares this list of installed apps with an ad¬ 
vertising company, even in pseudo-anonymized form, i.e., 
after removing all direct user identifiers (and even if app 
names are replaced with their hashes). This is because the 
advertising company might implicitly know a subset, say K, 
of installed apps of a user in which their ad library is present. 
So if K apps are enough to uniquely identify a user in the 
dataset, the advertiser would be able to re-identify the user 
in the Twitter dataset, and hence, learn about all the other 
installed apps of that user. By knowing this whole list of 
installed apps of a user, the advertiser can learn about that 
user’s interests and habits (as demonstrated in [^), and 
consequently, might be able to deliver the targeted ads di¬ 
rectly in these apps in which its library is present. We believe 
that this is a real privacy threat to smartphone users (both 
Android and iOS) today as apps running on these OSs can 
access the list of installed/running apps. It is to be noted 
that Android apps do not require any permission to access 
the whole list of installed apps. On iOS, Apple does not 
provide a public API to access the list of installed apps but 
apps can get the list of currently running apps at any time. 
And if an app makes a frequent scan of currently running 
apps over a period of time, the list can converge very fast to 
the list of installed apps. 

'http://goo.gl/OOFbDO 





Contributions: The main contributions are as follows. 

• We show that 99% of the lists of installed applica¬ 
tions of users are unique out of a total of 55 thousands 
users. Moreover, as few as two applications are suffi¬ 
cient for an adversary to identify an individual’s appli¬ 
cation list with a probability of 0.75 in our dataset. 
The re-identification probability increases to almost 
0.95 if the adversary knows 4 apps. We stress that 
these results were obtained without considering any 
system apps (which are common for all users), and 
apps were identified only by the hash of their names. 
Incorporating additional information into their iden¬ 
tifiers, such as app version, time of installation, etc., 
would increase these probabilities even more. 

• We propose an unbiased estimate of the real unique¬ 
ness of any subset of applications, i.e., the probability 
that a randomly selected subset of apps with cardi¬ 
nality K is unique in the dataset. For this purpose, 
we use a Markov Chain Monte Carlo method to sam¬ 
ple subsets of applications from a dataset uniformly 
at random. We prove that this chain is generally fast¬ 
mixing with most practical datasets, i.e., has a running 
time complexity which is roughly linear in the dataset 
size and K. This result might be of independent in¬ 
terest, as this technique can be used to sample subsets 
with arbitrary cardinality from any set-valued dataset. 

• We attempt to predict the uniqueness of lists of ap¬ 
plications in larger datasets using standard non-linear 
regression analysis. Our learned model performs well 
on our limited app dataset as well as on mobility data 
with sufficiently large number of users. However, in 
case of mobility data, we find that the model is not able 
to predict well if it is trained with smaller datasets, 
e.g., of the size of our app dataset. Therefore, we con¬ 
clude that our app dataset at hand might probably 
be too small to accurately predict the uniqueness in a 
larger datasets such as all Android users worldwide. 

2. UNICITY AS A MEASURE OF RE- 
IDENTIFIABIUITY 

Let A denote the universe of all apps, where each application 
is represented by a unique identifier in A. A dataset D C 
2 ^ \ {} is the ensemble of all apps of some set of individuals, 
where |D| denotes the number of individuals in D. A record 
D„, which is a non-empty subset of A, refers to all apps of 
an individual u in D. A set of applications with cardinality 
K is shortly called Lf-apps henceforth. The set of all Lf-apps 
over A is denoted as A^. 

Definition 1 (Unicity) Letsupp{x,D) denote the support 
of X £ in dataset D, i.e., the number of records in D 
which contain x. Then, 

Ija; : x £ A supp[x,D) = 1}| 

^ \{x : X £ A^ A supp(x,D) > 1}| 

is defined as the unicity (or uniqueness) of K-apps in D. 

The unicity of if-apps is the relative frequency of K-apps 
which are contained by only a single record. In general, rel¬ 


ative abundance distribution (RAD)[^is a relative frequency 
histogram H = {Hi, H 2 ,..., Hn) of Lf-apps with respect 
to a dataset D, where Hi denotes the relative frequency of 
Jf-apps which are contained by exactly i records in D, i.e., 

jjr \{x:x^A^ Asupp(x,D) — i}\ 

^ |{a::a:EA^ Astipp(x,D)>l}| 

Unicity is strongly related to re-identifiability, and we use 
it as a measure of privacy in this paper: it is the prob¬ 
ability that an adversary, who only knows K applications 
installed on a user’s device, can single out the record of 
this user in D. Indeed, any A'-apps which is unique in D 
can be used as a personal identifiable information (PII) of 
its record owner. Specifically, if the adversary knows such 
AT-apps, it can easily identify the corresponding record and 
retrieve all the applications installed by its owner, even if 
D is pseudo-anonymized (i.e., does not contain any direct 
PII such as device ID or personal name). Therefore, large 
unicity usually indicates a serious privacy risk in practice. 

3. APPROXIMATING UNICITY WITH 
SAMPUING 

To compute unicity, and RAD in general, the support of 
all different Ai-apps in D should be calculated. However, 
this is usually prohibitively expensive in practice. Therefore, 
like previous works [^ [^, we rely on sampling to estimate 
unicity. In particular, let denote the set of all A'-apps 
which occur in at least one individual’s record, i.e., = 

{x : X £ A^ A supp{x, D) > 1}. We randomly sample a 
set V of A'-apps from , and approximate the real unicity 
Hi by the sample unicity Hi = \i----sVAs^p(x,D)=i\ ^ 

V C is the sample set, and n = lUI is the sample size. 

3.1 Biased vs. unbiased estimation of unicity 

How should we sample A'-apps from the dataset? A popular 
technique, which has been used in several works [^[^, first 
samples a user uniformly at random in D, and then a set 
of K applications from this user’s record also uniformly at 
random. However, this simple technique provides a biased 
estimation of the unicity in Definitionj^ if the estimator re¬ 
mains the sample mean Hi, since E[Hi] 7 ^ Hi. In fact, A'- 
apps which occur in more records of D become more likely to 
be selected by this approach (assuming records have similar 
sizes). As a result, this sampling method is biased towards 
more popular A'-apps, and the measured unicity is an under¬ 
estimation of the real unicity Hi what one would get with 
an unbiased estimator of Hi. Such an unbiased estimator 
can be the sample unicity Hi of A'-apps which are sampled 
truly uniformly at random from D. This is also illustrated 
in Figure]^ where the sample unicity of biased and unbiased 
(i.e., uniform) samples are reported. 

Before describing our unbiased estimation of unicity Hi, we 
shed some light on the privacy semantics behind the two 
sampling approaches. The biased technique approximates 
the success probability of an adversary who is more likely to 
know popular A'-apps from the application set of any user. 
For instance, continuing the the case of the advertiser from 
Section the advertiser’s library should be more likely to 

^This term is often used in the field of ecology to describe 
the relationship between the number of observed species as 
a function of their observed abundance. 







be used by popular apps (such as Facebook, Twitter, etc.), 
which are installed on many devices, rather than by other 
less popular apps. However, this is not necessarily true and 
in general, an advertiser’s library can be included in any K- 
apps of a user. In the rest of the paper, we assume that the 
adversary can learn any if-apps of any users in D with equal 
probability, which is the most general assumption in prac¬ 
tice. Therefore, we are interested in an unbiased estimator 
oi Hi. 

3.2 Uniform sampling of /f-apps 

A unbiased estimation of Hi is obtained, if Hi is computed 
over a sample set where each A-apps can appear with equal 
probability. Hence, our task is to sample an element from 
uniformly at random for any K. A first (naive) ap¬ 
proach could be to use rejection sampling, i.e., sample a 
candidate AT-apps from uniformly at random, and then 
accept this candidate as a valid sample only if it also oc¬ 
curs in D. Otherwise, repeat the process until a candidate 
is accepted. Although sampling a candidate from is 
straightforward, it is very likely to be non-existent in D (es¬ 
pecially if K is large), and hence, its running complexity is 
0(1 A|^) in the worst case. An alternative approach could be 
to enumerate and choosing one element directly from 
uniformly at random. However, the complexity of this 
approach is still 0(|Z)|(max„ |0„|)^/A'!). Unfortunately, 
these naive methods provide acceptable performance only if 
K is small. As Table shows, in our dataset, |A| = 92210, 
maxu |0„| = 541, \D\ = 54893, and we wish to estimate the 
unicity when 1 < A < 10. 

We instead propose a sampling technique based on the 
Metropolis-Hastings algorithm [10[ [^, which is a Markov 
Chain Monte Carlo (MCMC) method. Our proposal has 
a worst-case complexity of only 0{K\D\/H*), where Hi is 
roughly the unicity of A-apps in D. As the unicity of A-apps 
is large, especially if A is large, the complexity is approxi¬ 
mately 0{K\D\) in practice. Hence, our sampling technique 
remains reasonably fast even for larger values of A. 

In particular, we construct an ergodic Markov chain, de¬ 
noted by A4, such that its stationary distribution tt is ex¬ 
actly the distribution that we want to sample from, that is, 
the uniform distribution over . Each A-apps in cor¬ 
responds to a state of A4, and we simulate M until it gets 
close to TT, at which point the current state of A4 can be 
considered as a sample from tt. M is detailed in Algorithm 
At each state transition, M picks a candidate next state 
C independently of the current state S (in Line 6-7). In Line 
8, the candidate is either accepted (and M moves to C) or 
rejected with certain probability (in which case the candi¬ 
date state is discarded, and A4 stays at S). The main idea 
is that, at each state, we use the fast but biased sampling 
technique, which is described in Section |3.1[ to propose a 
candidate C (in Line 6-7). We correct this bias by adjust¬ 
ing the acceptance/rejection probability (in Line 8) accord¬ 
ingly; A4 is more likely to accept such A-apps which are less 
likely to be proposed in Line 6-7. Indeed, as 7r(S') = 7r(C'), 

the probability of acceptance is min ^1, proposed”] ) ~ 

min ("l, = min(l,g(5')/g(C)). A more 

formal analysis is described in Appendix]^ The proofs of 


all the theorems in this paper can be found in Appendix 


Algorithm 1 MCMC sampling {M) 

1: Input: Dataset D, A, ^ of iterations t 

2: Output: A sample S G 

3: Let U := {Du : \Du\ > K A Du G D} 

4: Let S be an arbitrary A-apps in 
5: for k = 1 to t do 

6: Select an individual u G [1, \U\] uniformly at random 

7: Select a subset C C Uu uniformly at random such that 

\C\=K 

8: Let S := C with probability min (1, g(S')/ 9 (C')), where 

9: return S 


Theorem 1 M in Algorithm^is an ergodic Markov chain 
whose unique stationary distribution is the uniform distri¬ 
bution over for any K. 


Convergence of M. How to adjust t in Algorithm We 
prove that a “good” uniform sample from can be ob¬ 
tained roughly after 0{\D\) iterations in most practical 
cases. The time that AA takes to converge to its station¬ 
ary distribution tt is known as the mixing time of AA, and 
is measured in terms of the total variation distance between 
the distribution at time t and tt. 

Definition 2 (Mixing time) For ^ > 0, the mixing time 
tm (C) 0/ Markov chain AA is 

tm{0 = min{t' : \\Pm - -xWtv < ^,Vt > t'} 

where \\Pf, - 7r||t„ = max,,gs^ic I \PM{x,y) - 7r(y)| 

defines the total variation distance. PX 4 {x,y) denote the t- 
step probability of going from state x to y, and P]^ denote 
the t-step probability distribution over all states. 

The next theorem shows that AA’s mixing time is 
0(|Z)| log(l/C/^fr), where \D\ is the dataset size and H* 
is the unicity of A-apps from the largest record of A. As 
the unicity of A-apps is usually large in practice, especially 
if A is large, AA is fast-mixing in general. In our dataset D, 
0.6 < Hi < 0.999 for 2 < A < 90 

Theorem 2 (Mixing time of AA.) Let Hf denote the 
probability that a randomly selected set of A items from the 
largest record (i.e., having the most apps) in D is unique. 
Then, XMiO ^ \P\^A^/C)/Hi for any A. 

We emphasize that the bound in Theorem is a worst-case 
bound, and the real convergence time can be much smaller 
depending on the dataset D as well as the starting state of 
the chain. As we show next, AA indeed exhibits much smaller 
convergence time than its theoretical worst-case bound for 
our dataset. We detected the convergence of AA using the 
Geweke diagnostic [^; if At denotes a Bernoulli random 

^The unicity of A-apps from a single record can easily be 
approximated with Inequality!^ using uniform samples over 
all A-app s fro m the record. Likewise the biased sampling 
in Sectio n |3.1[ this sampling is easy to implement (e.g., by 
choosing A items individually from the record without re¬ 
placement). 












Figure 1: Convergence of our Markov chain M. The a-score, depending on the number of iterations t, of 20 independent 
chains are plotted. 


variable describing whether the current state of M at time t 
is unique, and Xt = (Xi,X 2 ,... ,Xt), then we compute the 
«-score z = g[Xa]-g[Xj,] ^ where Xa is the prefix of Xt 

yVar{X„)+Var(Xi,)’ 

(first 10%), and Xt, is the suffix of Xt (last 50%). We declare 
convergence when the a-score falls within [—1,1]. Indeed, if 
Xa and Xt, become identically distributed (i.e., Xa and Xt, 
appear to be uncorrelated), the 2 values become normally 
distributed with mean 0 and variance 1 according to the law 
of large numbers. We simulated 20 instances of M each 
starting at different states, and plotted the z-score of each 
chain depending on the number of iterations t in Figure]^ 
This shows that convergence is detected roughly after 3000 
steps in all chains with different values of K. When this 
happens, the current state can be taken as a valid sample. 
Hence, in the sequel, we run M with t — 3000 to obtain a 
uniform sample from 

We note that q in Algorithm can be computed rapidly 
in practice by precomputing another dataset T, where each 
record corresponds to an application in D, and record i con¬ 
tains the sorted list of all users who have application i in 
their record. Hence, the set of users who have a common 
specihc X-apps can be computed easily by taking the inter¬ 
section of the corresponding records in T. The complexity of 
this operation is 0(i('|imax|), where |imax| is the maximum 
record size in T, i.e., the number of users of the most popular 
application in D. Fast implementations of the intersection 
of sorted integers are described in [^. 


3.3 Computing the sample size 

In order to compute the sample size, we use the Chernoff- 
Hoeffding inequality on the tail distribution of the sum 
of independent (but not necessarily identically distributed) 
Bernoulli random variables. In particular, if Xi denotes a 
Bernoulli random variable describing the event that the ith 
sampled AT-apps is unique in D, then the deviation of the 
estimator Hi — "^"^iXi/n from E[Hi] = Hi is given by 
Pr — Hij > ej < 26“^"^^ , or equivalently. 


> l-2e"^"^" (1) 


— 9 2 

where £ is the sampling error and cr = 1 — 2e is the 

confidence. Hence, we obtain that 

For example, if £ = 0.01 and a = 0.99, we need to sample at 
least 26492 JF-apps from D (with replacement) to guarantee 
that \Hi — Hi\ < 0.01 with probability at least 0.99. 


Considering RAD, suppose we aim at approximating the 
first k relative frequency values of H, i.e., {Hi, H 2 , ■ ■ ■, Hk)- 
Therefore, we wish to simultaneously satisfy Inequality!^ for 
each Hi (1 < i < k), where Hi = Yl’j^i Xj = 1 

if the jth sampled X-apps occurs in exactly i records of D, 
otherwise X) = 0. Hence, 


Pr 


/\ liii-H^ <£ 


> 1-^Pr [|h.-H,| >£] 

> 1 - 


— 9 ^ 

where 5 = 1 — 2ke is the confidence. Therefore, 


n > 



2k 


1 — a 


(3) 


For instance, for £ = 0.01, a = 0.99, and k = 10, we need to 
sample at least 38005 X-apps from D (with replacement). 
This will guarantee that \Hi — Hi\ <0.01 for all 1 < i < fc 
with probability at least 0.99. 


4. EVALUATION 

4.1 Dataset characteristics 

The analyzed dataset comes from the Carat research project 
[12| . The dataset includes data from 54,893 Carat Android 
users between March 11, 2013 and October 15, 2013 [15| . 
During this period, the Carat apjj^was collecting the list of 
running apps (and not the list of all installed apps) on users’ 
devices when the battery level changes. As collecting the list 
of running apps multiple times over more than 7 months is 
likely to sum up to the set of all installed apps of a user, 
we consider a record as the set of installed applications in 
this paper, even if a record might not be the complete set of 
installed apps all the time. 

“^http: //carat. cs .helsinki . f i 
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Figure 2: Cumulative distribution of installed apps 



Figure 3: Probability distribution of number of records con¬ 
taining a particular app 


We removed system apps from all records because they are 
common to all users. Without system apps, our analyzed 
dataset contains 92,210 different applications whereas total 
number of apps available on the GooglePlay were around 1 
million during this tim^ Furthermore, the average number 
of apps installed per user in our dataset is 42 with a standard 
deviation of 39. Table [T] summarizes the main characteristics 
of our dataset D. 

Figurej^depicts the cumulative distribution of the number of 
apps installed by a particular user. We note that more than 
90% of users have 100 or fewer applications. Probability 
distribution of the number of users who installed a particular 
app is depicted by Figure Notice that more than half of 
the apps are contained by only a single record in D. 


Ethical Considerations. The data were collected with the 
users’ consent, and they were explicitly informed that their 
data could be used and shared for various research projects. 
In fact, the Carat privacy policy (available at http:// 
carat.cs.helsinki.fi) clearly specifies that “Carat is a 
research project, so we reserve the right to publish our re¬ 
sults online and in academic publications. We also reserve 
the right to release the data sets into the public domain.” 
Also, the dataset was shared with us by the Carat team in 
a pseudo-anonymised form. In particular, identifiers were 
removed, and each application name was replaced with its 
SHAl hash. It contained 54,893 records 15 , i.e. one record 


per user. Each record is composed of the list of applica- 


^ http://en.Wikipedia.org/wiki/Google_Play 


Dataset size \D\ 

54, 893 

H of all apps in D 

92,210 

Maximum record size max„ |D„| 

541 

Minimum record size min„ \ Du\ 

1 

Average record size 

42 

Std.dev of record size 

39 


Table 1: Characteristics of our dataset D 


tions installed by the user. Furthermore, the data sharing 
agreement that we signed, stipulated that we cannot use the 
dataset to deanonymize the users in the dataset. 

4.2 Results 

We hnd that 98.93% of users have unique set of installed 
apps in D, i.e., there does not exist any other user with 
the same set of installed apps. This means that if we know 
the list of all the installed apps of a user in the dataset, we 
can identify that user in the dataset with a probability of 
0.99. As the adversary might not always be aware of all the 
installed apps of a user in practice, we measure the unicity 
of A-apps for different values of K using our dataset D. 



Figure 4: Uniqueness probability as a function of K for bi¬ 
ased and unbiased sampling 

Figure]^ gives the unicity of A-apps with different values of 
A (changing from 1 to 10) for the two different types of sam¬ 
pling techniques described in Section]^ the biased sampling 
from an d our unbiased, uniform sampling described 

in Section 73.2| In each case, we computed the sample size 
using Inequality with maximum sampling error e = 0.01 
and confidence a = 0.99. Otherwise stated explicitly, we use 
this sample size in the sequel. This results in 26492 samples 
for each value of A. As biased sampling favours more pop¬ 
ular A-apps, the sample unicity Hi is less than with our 
unbiased approach. In particular, the difference can be as 
large as 0.5 for smaller values of A, while it decreases as A 
increases. For the unbiased estimation, the sample unicity 
is 0.75 with A = 2, and it reaches 0.99 when A = 6. 

Figure shows that the unicity of any A-apps is large and 
hence there would be a real privacy threat if such dataset 
was released. Moreover, Figure [^depicts the relative abun- 





























































dance distribution in D, when 1 < K < 8. RAD provides 
complementary information about users’ privacy in D. In 
particular, even if the adversary cannot single out the record 
of the target user in D, it might still learn new information 
about him/her. For example, if the known if-apps of the 
target user are shared by multiple users in D and all these 
users have some identical apps besides the known A-apps, 
then the adversary learns that the target user also has these 
apps installed on his/her phone. This attack is often re¬ 
ferred to as the homogeneity attack in the literature [^. We 
computed the required sample size using Inequality [3 with 
e = 0.01 and cr = 0.99 for k = 20. This gives 41470 samples 
overall, which were taken with our uniform sampler M. 



Figure 5: Relative abundance distribution of apps for differ¬ 
ent sizes of sets of apps 

To study the effect of number of users on unicity, we ran¬ 
domly select subsets of users of different sizes from D, and 
calculate the sample unicity within these subsets. Figure 
depicts how unicity changes with the number of users in our 
dataset. We find that unicity decreases if the user number 
increases. However, this decrease becomes less significant 
for larger number of users. This is probably due to the fact 
that the number of apps starts to saturate if the user number 
increases. 

As the size of our dataset is much less than the population 
size of all Android users worldwide (which was roughly 1 
billion as of 2014 with 1.2 million different applications 
available on GooglePlajQ, we aim at predicting the unicity 
in a larger dataset (possibly in the whole population) in the 
next section. 

5. UNICITY GENERALIZATION FOR 
LARGER DATASETS 

Information surprisal can be used to measure unique¬ 
ness in the population 73 [^. In our case, the pop¬ 
ulation is all the Android users worldwide, to which 
we want to generalize our results. As information sur- 

®http://www.engadget.com/2014/06/25/google-io- 
2014-by-the-numbers/ 

'http://www.appbrain.com/stats/number-of-android- 
apps 



prisal of any 7f-apps {Ai, A 2 ,..., Ak} over D is equal to 
— log(Pr[Ai, A 2 ,..., Arc]), we must need to hrst measure 
the co-occurence probability Pr[Ai, A 2 ,..., Ax] of these 
apps in D. The co-occurence probability can be easily com¬ 
puted if we can assume that apps co-occur independently 
in the dataset as Pr[Ai, A 2 ,..., Ax] ~ IliLi 
Pr[Ai] (the popularity of app Ai) can be obtained from 
the download count of Ai available on Google Play Store. 
However, this is not the case in a real-world scenario as 
there exist correlation between apps installed by a user. As 
our dataset is very likely to be too limited to capture this 
correlation (as our dataset contain only 93K distinct apps 
whereas there are more than 1.2 million available apps on 
GooglePlay), we cannot take this approach to measure the 
uniqueness in the population of Android users. We rather 
employ regression analysis on our dataset which does not 
rely on this correlation information to predict the unicity in 
a larger dataset. 

For regression analysis, we randomly create datasets of dif¬ 
ferent sizes from our original dataset and compute the sam¬ 
ple unicity for these datasets of different sizes. This gives 
us the tuples {x,y) where x is the number of users in a par¬ 
ticular dataset (independent variable) and y is the calcu¬ 
lated unicity value dependent on x. Here, we assume that 
the unicity value y only depends on the number of users 
X. We must note that, in reality, unicity depends on many 
factors such as the characteristics of the users, how many 
(un)popular applications users tend to have, etc. As it is 
difficult to take into account all these factors either because 
they are unknown or hard to measure, we assume that unic¬ 
ity in general is a “proper” function of only the total number 
of users in the dataset. That is, all other dependent factors 
are implicitly incorporated into the model, i.e., the general 
form of the function. 

Once we have these {x,y) tuples, our goal is to select the 
best model and its parameters that capture the relation be¬ 
tween X and y. The overall approach is as follows: we divide 
our {x,y) tuples in training and test sets. We select the best 
model (i.e., a function family) based on the general char¬ 
acteristics of application unicity and then learn its exact 


















































































































(a.) K = 1,5 = 0.016 



(h) K = 2, 5 = 0.007 



(c) K = 3, 5 = 0.008 



(d) ii" = 4, <5 = 0.005 (e) K = 5, 5 = 0.001 (f) K = 6, 6 = 0.002 

Figure 7: Unicity generalization for different values of K, trained all with maximum 37000 users. The learnt models (i.e., 
f{x)) are present in the legend, x-axis corresponds to normalized dataset sizes with a normalization factor of 1/54893, and 
j/-axis depicts sample unicity. 


parameters using our training set. Finally, using the best 
model thus obtained, we test its accuracy on the test set. 
This model should be able to predict the unicity value for 
any dataset of arbitrary size. 

Training and Testing. We divide our original dataset in 
54 smaller datasets, each of size varying from Ik to 54k. 
We take the first 70% of all {x,y) points for training and 
the last 30% (corresponding to larger datasets) for testing. 
We deliberately take the last points corresponding to larger 
datasets for testing set because we aim to evaluate our model 
performance on larger datasets, i.e., we want to test how 
accurately the learned model could be extrapolated. 

As we divided our datasets by randomly selecting users out 
of the original dataset, users in the training and testing set 
may overlap. Flowever, we found that unicity merely de¬ 
pends on the number of users in the dataset and not specif¬ 
ically on the underlying individuals. For example, we com¬ 
puted the unicity of 50 different sets of 1000 users selected 
randomly, and found out that the variance of the measured 
sample unicity is very small. 

Model selection. To select our model, we first tried linear 
regression with non-linear basis functions (polynomials of 
various orders) with and without regularization. However, 
they provided very inaccurate predictions of unicity. Finally, 
we selected the following exponential model describing an 
exponential decay of unicity: 

(4) 


The rationale behind choosing this model is as follows. Fig¬ 
ure shows that if additional users were added to our 
dataset, the number of apps would reach the maximum num¬ 
ber of apps in the population early as there are fewer apps 
on GooglePlay than total number of Android users. This 
suggests that, after a certain point, additional users would 
not bring many new apps but still, they would bring new 
combinations of already existing apps. The addition of new 
combinations of apps should lead to the increase in unicity. 
However, the newly added users can lead to the decrease in 
unicity as well due to the fact that they can also have many 
already existing combinations of apps. As these two effects 
of adding new users to the dataset run opposite to each 
other, we suppose that unicity converges to a value greater 
than zero which is denoted by c in Equation]^ Indeed, as 
Figure]^ shows, although unicity decreases with the increase 
in the user number, the amount of this decrease tends to de¬ 
crease as well. A similar observation was made in Also, 
we used square root of x in the exponent in Emation]^ be¬ 
cause taking square root is variance-stabilizing In fact, we 
tried other powers of x in the exponent but square root lead 
to the best results. 

The goal of the regression is to compute parameters a,b and 
c in Equation from the training set {x,y) tuples. In fact, 
these parameters might be computed employing either stan¬ 
dard non-linear regression directly or by first transforming 
Equation]^ into linear form and then applying linear regres¬ 
sion. We use standard non-linear regression because it ex¬ 
plicitly computes the lower bound on the unicity value (i.e.. 


®https://en.wikipedia.org/wiki/Variance- 
stabilizing_transformation 


/(x) = a ■ exp(—foy^) + c 



















































Figure 8: No of distinct apps installed by users 

c in Formula . The value of x is normalizecj^ by dividing 
X with the maximum size of the dataset for which we want 
to predict the unicity value. 

Results. As an error metric, we measure the average abso¬ 
lute error denoted by 5, i.e., 

n 

i 

where n is the number of predicted points, yt is the real 
unicity value, and f{xi) is the predicted value. 

Figure presents how our exponential model performs on 
the test set for different values of K. Although our model 
can predict the trend of the unicity for large number of users, 
it slightly overestimates the real unicity in the test set. Nev¬ 
ertheless, the average error S on the test set is only around 
0.01. As our app dataset is very small as compared to the 
whole Android population, we cannot evaluate performance 
of our model for large number of users, e.g., a few million, or 
the whole Android population. Therefore, we cannot claim 
that our model will be able to accurately predict the unicity 
for datasets having large number of users even if it performs 
reasonably well on our test data. 


Model validation on a different dataset. To further 
demonstrate that our model is a meaningful approach to 
predict unicity in large populations, we test it on a large 
mobility dataset provided by a telecom operator in Europe. 
This dataset contains the Call Data Records (CDR) of 1 
million users from a large european city over 6 weeks. Each 
record in the dataset corresponds to a user and contains the 
set of his/her visited cell towers, where the total number of 
different cell towers is 1303. Erom this dataset, we created 
smaller datasets of different sizes {x ranging from 1000 to 
1 million users). Then, we trained our model on the first 6 
points (i.e., until the dataset size of 50,000). Figure[^shows 
that the model does not predict accurately the unicity for 
large datasets and the error can be as large as 0.6. Next, 
we trained the model on the first 7 points (i.e., until the 
dataset size of 75,000). In this case, we find that the model 
performs significantly better than in the previous case with 
an error of 0.13 on average. Einally, we trained the model on 

"https://en.Wikipedia.org/wiki/Feature_scaling 


the first 8 points (i.e., until the dataset size of 100,000). In 
this case, we find that the model have accurate predictions 
for larger datasets, e.g., for a test dataset of size 1 million 
(10 times more than the maximum size of the dataset used 
in the training phase), and the error is 0.05 on average. 

We find that a mobility dataset of 0.1 million users is suf¬ 
ficient to learn an accurate model and predict the unicity 
values for a larger mobility dataset. However, as we saw 
earlier, the model is not able to predict well the unicity of a 
large population if it is trained on a dataset of only 50, 000 
users. This may suggest that 50,000 users might be too 
small in general to learn an accurate model and therefore, 
our app dataset of 50, 000 users might not be sufficient to 
learn the model. On the other hand, even if this model per¬ 
forms well on a mobility dataset with 1 million users, it does 
not necessarily imply its good performance on large appli¬ 
cation datasets due to the different data and user character¬ 
istics. Nevertheless, these two experiments together show 
that our exponential model can be a meaningful approach 
to predict unicity in large populations. 

6. CONCLUSION 

The paper shows that the list of installed applications is 
quite unique. This result has few implications on user’s pri¬ 
vacy. Eirst, since this metadata is unique, it could easily be 
used to profile users, e.g., based on the category of installed 
apps. This is what Twitter is doing to provide interest- 
based targeted ads to users. Second, as a combination of 
even small number of installed apps is quite unique, this 
information could be used to re-identify users in a dataset. 
For example, if Twitter decided to publish the list of apps 
installed by its users on their smartphones, it would be easy 
for anyone, who knows 4 or 5 apps of a given user, to re¬ 
identify him and discover other apps that are also installed 
on his smartphone. This makes anonymization of this infor¬ 
mation challenging, and this is part of our future work. 

In general, mobile users reveal many pieces of information 
that, when combined together, provide a lot of information 
about users and can be used to build personalized profiles. 
Since people are unique in many different known and un¬ 
known ways, preserving the privacy of mobile users is very 
challenging. New protection measures need to be devised. 
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APPENDIX 

A. PROOF OF THEOREM H 

Metropolis-Hastings algorithm. Consider an ergodic 
Markov chain Q with transition graph G(A, E) and tran¬ 
sition matrix Pg, where A is the finite state space. For 
each a; £ A, let AC : A X A —>■ [0,1] denote a (not neces¬ 
sarily symmetric) proposal probability distribution function 
such that for all x & A, k{x,x) -\- r.{x,y) = 1. The 

transitions of Q are defined according to the Metropolis- 
Hastings rule as follows. From any state x £ A, first se¬ 
lect an i/ G A such that (a;, y) £ E with probability k{x, y). 
Then, “accept” the transition from x to y with probabil¬ 
ity min ^1, ^; otherwise stay at x. It is not diffi¬ 

cult to show that such a Markov chain Q is reversible, 
i.e., TT{x)Pg{x,y) = T^(y)PQ{y,x) and therefore its station¬ 
ary distribution cr is unique [^. Consequently, after suffi¬ 
ciently many transitions, the distribution of states will be 
very close to tt. Notice that there is no need to compute 
the normalization constant of tt, even if [flj is very large 
(i.e., an exponential function of K in our problem), because 
it appears both in the numerator and denominator of the 
transition probabilities. 


Proof of Theorem [TJ In each iteration, A4 can select 
any individual w in D. Hence, at any state, M can visit any 
state in . Therefore, M is connected and aperiodic. Also 
notice that 

tx{C)k{C,S) _ k{C,S) 

'k\s)k{S,C) ~ k{S, C) 

Ev„:a„OcV('^K') 

_ Evu:C 7 „DSni=l 

Evii:C/„DC rii^l 

= <li.S)/q{C) 

where tt is the uniform distribution over . There¬ 
fore, a candidate next state is accepted with probability 
min ^1, cj ^ = Riin (1> Q{S)/q{C)), which means that 

M is reversible and its unique stationary distribution is tt 
according to the Metropolis-Hastings rule. □ 

B. PROOF OF THEOREM H 

In order to prove A4’s mixing time, we use a standard cou¬ 
pling argument which is described below. 

Definition 3 (Coupling) A coupling of a Markov chain 
M on state space Q, is a Markov chain on Q x it defining a 
stochastic process (Xt,yt)“o such that 


initial states Xq = x and Yq = y); that is, Pr[Xt+i = 
b\Xt = a] = PM{a,b) = Pr[yt-i-i = h\Yt = a\; and 
• if Xt = Yt, then Xt+i = Yt+i. 


Condition 1 ensures that each process, viewed in isolation, 
is just simulating the original chain A4, and the coupling is 
designed such that Xt and Yj tend to coalesce (i.e., move 
closer to each other according to some notion of distance). 
Once they meet, Condition 2 guarantees that they will move 
together forward. The time of this coalescence can be used 
to upper bound the mixing time which is shown by the next 
lemma. 

Lemma 1 (Coupling lemma j^) Let {Xt,Yt)’lZo be a 
coupling of a Markov chain AA. For initial states x, y let 
= min{f : Xt = TtjXo = x,Yq = y} denote the random 
variable describing the time until Xt and Yt coalesce. Then 

\\Pm — Trjjto < max Pr\T^’^ > i\ 

x,y^O. 


Proof of Theorem [2j Define a coupling {Xt, Yt) as fol¬ 
lows. Let Xt and Yt choose the same individual u and subset 
C in Line 6 and 7 of Algorithm[^ respectively. This is a valid 
coupling according to Definition since both Xt and Yt are 
the exact copies of Ai, and they move together after they 
coalesce. 


Let p{x) = k{-,x) denote the probability that x = C is 
selected in Line 6 of Algorithm Let Xq = x and Yq = y, 
and, w.l.o.g., p{x) < p{y). Due to the coupling rule, Xt 
and Yt can coalesce at any time, since PM{x,y) > 0 for all 
x,y £ LI. This happens when both Xt and Yt select a state a 
such that p{z) < p{x) < p{y), since q{z) < q{x) < q{y) will 
also hold. Let Umax = max^ U„. For any x,z £ Ll, where 2 ; 
occurs only in Umax, p{z) < p{x). Indeed, 


p{x) 


1 

W\ 


E 



> 


1 1 

\U\ ^Tmaxl) 


>p{z) 


Hence, Xt and Yt coalesce as soon as they select any z £ LI 
which occur only in the largest record in D. Therefore, 

\\Pm —'xWtv < max Pr[T^’^ > fl (by Lemma fTjl 

x,yen^ 

00 

<^(i-ur/|u|)*Pi7|u| 

i=t 

< (1-HT/|U|)‘ 

< exp(-ti7i*/|U|) 
which proves the theorem. □ 


• each of the processes (Xt,-) and {■,Yt), viewed in iso¬ 
lation, is a faithful copy of the Markov chain AA (given 












